Intro to Visualization in Python - Exploratory data analysis - 1
One should look for what is and not what he thinks should be. (Albert Einstein)
Exploratory Data Analysis: Topic introduction
In this part of the course, we will cover the following concepts:
- Exploratory data analysis use cases
- Perform EDA on data
Module completion checklist
|
Objective
|
Complete
|
|
Discuss data visualization and exploratory data analysis
|
|
|
Describe chart types by data and form
|
|
The Challenger explosion example
- The 1986 Space Shuttle Challenger explosion is an emblematic case study of how data visualization can play an essential role in decision-making
- The explosion happened due to low temperatures that affected shuttle parts
- Edward Tufte, a visualization expert, argues that the cause of this tragedy was an unreadable format of data given to decision-makers
The original visualization of the temperature
- The chart below was presented to the experts at the time
- How
easily interpretable do you think it is?
![centered]()
The revised visualization of the temperature
- Edward Tufte argues a better chart may have prevented disaster
- How
easily interpretable do you think the revision he created is?
![centered]()
Data Visualization
- Data visualization is an attempt to make data more easily digestible by rendering it in a visual context (e.g., charting, graphing, etc.)
- We use data visualization to transform raw data into something compelling
- Data visualization is at the intersection of art and science
Why visualize data?
- Visual context provides insights on patterns, trends, and correlations that might be difficult to detect otherwise
- It is a simple way to convey concepts and provide visual access to large amounts of complex data
- Using Python is excellent as it has multiple graphing libraries with many valuable features
Why build a visualization?
- To provide valuable, interpretable, and relevant insights
- To give a visual or graphical representation of data / concepts
- To provide an accessible way to see and understand trends, outliers, and patterns in data
- To try to confirm a hypothesis
Chat Activity: Explore a Dashboard
- What is a dashboard? It is a visual display of all your data
- Let’s assume you work at a recruitment firm, and the firm has a dashboard to track and view its KPIs (Key Performance Indicators)
- What KPIs would you like to track using that dashboard to help make better business decisions?
- Share your thoughts in chat
![centered]()
Exploratory data analysis (EDA)
- Exploratory data analysis (EDA) is the process of reviewing new data to discover patterns, spot anomalies, test hypotheses, and check assumptions
- It helps to create graphs without breaking the train of thought as you explore your data
- Visualization is an iterative process and consists of a few steps:
- Analyze
- Manipulate
- Graph
- Repeat
Exploratory data analysis in Python
- Python is a powerful tool for EDA because the graphics tie in with the functions used to analyze data
- What is possible using Python?
- Visualization tools available through multitudes of packages (e.g.
matplotlib, seaborn)
- The visualizations created are high quality graphics that can be saved as
SVG, PNG, JPEG, BMP, PDF
- Visualizations are often the best way to display patterns in data for printed publications
- Further, we will explore how to visualize data using Python and perform exploratory data analysis to understand and detect the patterns
Module completion checklist
|
Objective
|
Complete
|
|
Discuss data visualization and exploratory data analysis
|
✔
|
|
Describe chart types by data and form
|
|
Getting started with data viz
- Deciding on what visualization type to use will depend on the data and message you want to communicate
Categorical Data
- Categorical data is non-numeric or qualitative
- Insight: comparisons and proportions
- Chart types: vertical bar, column bar, horizontal bar, pie, bullet charts, stacked bar, and tree maps
Univariate Data
- Univariate data consists of a single numeric variable
- Insight: distributions, proportions, and frequencies
- Chart types: histogram, density, box plots
- Is the data normally distributed?
- Are there any outliers?
- Do you notice any other patterns in the data?
- These are some of the steps for initial data exploration
Bivariate Data
- Bivariate data consists of two (or more) numeric variables (i.e., weight and height)
- Insight: relationships, correlation, proportions, and frequencies
- Chart types: scatterplot, bubble, parallel, radar, bullet, and heat
Trend Data
- Trend data includes a time-based data (i.e., years, months, days, hours, etc.)
- Insight: trends, comparisons, and cycles
- Chart types: line, area, bubble, vertical bar
Text Data
- Text data includes alphanumeric single words or phrases (keywords)
- Insight: sentiment, comparisons, and frequency
- Chart types: word cloud, histogram, stacked bar chart
Geospatial Data
- Geospatial data includes qualitative or quantitative information about specific locations
- Insight: locations, comparisons, and trends
- Chart types: chloropleth filled map, point map, connection map, isopleth map
Simple text or table
- Tables are helpful when communicating to a mixed audience or showing a few different units of measure
Bar Chart
- Bar charts are used to express larger variations in data and how individual data points relate to a whole, comparisons, and ranking
- They express quantities through a bar’s length, using a common baseline (=zero)
Note: when the data has lengthy names, using a horizontal bar chart will make the data easier to read
Line Chart
- Line charts are used to plot continuous data in some unit of time, such as days, months, quarters or years
- They can also be used to show multiple series of data
- A line graph can also represent a summary statistic, like the average and confidence level range or the point estimate of a forecast
Area Chart
- Area charts are used to summarize relationships between datasets, how individual data points relate to a whole
- The visual at the right shows the monthly trend of active operations
- In chat, share your thoughts on how you think this visual could be improved
Heatmap
- Heatmaps visualize data in tabular format, using colored cells to show the relative magnitude of the numbers
- When using a heatmap, it is helpful to restrict the number of different color gradations
- The visual at the right shows the busiest months ranked by the number of operations for each department
Scatterplot
- Scatterplots show the type of relationship between two numeric variables
- Scatterplots are often used in scientific fields and are sometimes viewed as “complicated” to understand, but there are real-world uses as well
- In chat, share your thoughts on what relationship this scatterplot represents
Review Quiz: Data Visualization
- Question 1: Which graph of the two makes it easy to determine what investment has a more significant market share?
- Share your answer in the chat
![centered]()
Review Quiz: Data Visualization
- Question 2: What is this graph called?
- Share your answer in the chat
![centered]()
Review Quiz: Data Visualization
- Question 3: Which graph represents the values accurately?Why do you think so?
- Share your answer in the chat
![centered]()
Review Quiz: Data Visualization
- Question 4: What type of graph is this?
- Share your answer in the chat
![centered]()
Review Quiz: Data Visualization
- Question 5: Which of the following graphs focus on trends rather than individual values?
- Share your answer in the chat
![centered]()
Module completion checklist
|
Objective
|
Complete
|
|
Discuss data visualization and exploratory data analysis
|
✔
|
|
Describe chart types by data and form
|
✔
|
Congratulations on completing this module!
![icon-left-bottom]()